## -- Attaching packages ------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v readr 1.1.1 v stringr 1.3.1
## v ggplot2 2.2.1 v forcats 0.3.0
## -- Conflicts ---------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
An effective topic model can summarise the ideas and concepts within a document - this can be used in various ways. A user can understand the main themes within the corpus of documents and draw conclusions from these from analysis of these topics or they can use the information as type of dimensional reduction and feed these topics into different supervised or unsupervised algorithms.
In this project, our group has used topic modelling to better understand the common topics that come up over the SONA speeches, how these are related to different presidents and speeches and how they change over time. In addition, the probability that a sentence belongs to a certain topic was used in an attempt to classify which sentence was said by which president (see Section XX)
The data used in this section is the clean and processed data as described in Section X. The resulting sentence data has been used and dissected further without consideration to train and validation unless otherwise stated.
The following methodology was followed:
Figure: Most popular terms
After tokenisation and removal of stop words, the top 20 most used terms across all of the SONA speeches are displayed. Unsurprisingly, “South Africa” is the most used term followed closely by “South African” and “South Africans” and “Local Government”. These terms do not add to our understanding of the topics and tend to confuse the topic modelling going forward. The removal of the terms allows for a cleaner interpretation. “Public service is then the most used term.
A pre-requisite of topic modelling is knowing the number of topics that each corpus may contain (i.e. the latent factor k) In some cases, this may be a fair assumption but without reading though each speech, how could one know how many different topics have been articulated in the SONA’s? Luckily, Murzintcev Nikita has published an R- package (ldatuning) that helps to optimise the number of topics (k) over three different measures. The measures used to determine the number of topics, are discussed in an RPubs paper which can found here: link and the following optimisation largely follows the accompanying vignette: [link] (https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html)
The following extract from the RPub paper gives a brief explanation of the methods used to optimise for k:
Extract from RPubs
“Arun2010: The measure is computed in terms of symmetric KL-Divergence of salient distributions that are derived from these matrix factor and is observed that the divergence values are higher for non-optimal number of topics (maximize)
CaoJuan2009: method of adaptively selecting the best LDA model based on density.(minimize)
Griffths: To evaluate the consequences of changing the number of topics T, used the Gibbs sampling algorithm to obtain samples from the posterior distribution over z at several choices of T(minimize)"
In addition to this, Nikita considers how the number of k may change over a validation or hold out sample. His term for this is“perplexity” which he defines as “[it] measures the log-likelihood of a held-out test set; Perplexity is a measurement of how well a probability distribution or probability model predicts a sample”
Below is an attempt to optimise for k and to check that the choice of k holds over an unseen data set.
Figure: Optimisation Metrics
From the above plot, the marginal benefit from adding another topic, stops at around 8-10 topics. In order to test this, the “perplexity” over a test sample for the document term matrix can be checked.
Figure: Perplexity Plot
As more topics are used, the perplexity of the training sample does decrease but that of the test sample increases from around 11 topic. The perplexity of the test sample seems to be minimised at around 8 topics.
The evidence from these two plots suggest that the optimal number topics sits at around 8 topics.
For this assignment, Latent Dirichlet allocation (LDA) was used for the topic modelling. Other methods, such as a Latent Semantic Analysis (LSA) or Probabilistic Latent Semantic Analysis (pLSA) could have been used but LDA is useful due to the fact that it allows: 1. Each document within the corpus to be a mixture of topics 2. Each topic to be a mixture of bigrams 3. The topics are assumed to be drawn from Dirichlet distribution (i.e. not k different distributions as with pLSA) so there are less parameters to estimate and no need to estimate the probability that the corpus generates a specific document.
The beta matrix produced gives the probability of the topic producing that bigram (i.e. that the phrase is in reference to that topic.) From this measure, one can get a sense of what the character of the topic is. By using the most popular phrases in each topic, understanding of the flavour of each topic emerges. However, it must be kept in mind that terms can belong to more than one topic so applying some logic to get a theme or flavour should be done liberally.
From the display of popular terms, it can be determined that the topic one has a vague connection to “job creation” - this is the most common terms but is supported by other terms that have a high probability of being generated by this topic such as: + “world cup” + “national youth” + “infrastructure development”
These concepts all support the idea of the job creation as each of these will generate jobs for the country. But there is some noise in the topic for “address terms” i.e. honourable speaker or honourable chairperson. “Nelson Mandela” and “President Mandela” crop up too which suggests that alongside the job creation theme, there exists some of what can be termed “terms of endearment”
As with the previous topic, there is some random “terms of endearment” in this topic as well (i.e. “madam speaker”) but it is not as evident as in the first topic. This is to be expected as bigrams can be generated by more than one topic as each topic is a mixture of bigrams! The next four terms sums out the main themes for this topic: + “Economic Empowerment” + “Black Economic” + “Justice System” + “Criminal Justice”
In summary, this topic can be summed as “Economy/ Criminal and Justice System”
Despite the most popular terms being “United Nation” and “private sector”, there a theme that is “developing”. As in development plan, resource development, national development and development programme etc. And thus, the topic is named.
Once again, there is a “term of endearment” in the popular terms (“fellow south” which is assumed short for “fellow South African’s” which is one of former President Zuma’s favourite phrases). With all the other terms combined, a theme of “Social Reform/ Regional and Municipal Government” takes shape.
Given that there is a possible trigram evident here, it may be worth exploring in future work.
“Public sector” and “private sector” are popular terms in topic 5. After consideration of the various other terms, of which some have cross over with other topics and discussion, the eventual name for this topic became “Public Sector Entities”
A different way of looking at this topic could be to investigate the biggest differential in terms between topics. For instance, using the log(base 2) ratio between topic 1 and topic 5, shows the terms that have the widest margin between the two topic (i.e. are far more likely to be in topic 5 versus topic 1)
Figure: WordCloud for Topic 5
For instance, “social programmes”, “human fulfilment”, “rights commission” are all generated in significantly larger proportions by Topic 5 compared to Topic 1 while “national social”, “training colleagues” and “sector unions” all exist with in Topic 1.
Given the naming of Topic 5 as “Public Sector Entities” and Topic 1 as “Job Creation/Terms of Endearment” these terms do seem to be grouped in line with expectation.
The LDA model allows each of the sentence to be represented as a mixture of topics. The gamma matrix shows the document-topic probability for each sentence. i.e. the probability that each sentence is drawn from that topic. For instance, the follow sentence sampled from random shows that it has a 0.905% probability of being drawn from topic 4 based on the use of the bigrams within it. The sentence appears to be talking about the water and the infrastructure around it. The label for topic of was “Social Reform/ Regional and Municipal Government” and this statement seems to be somewhat relevant to it.
| president | year | sentence | X1 | X2 | X3 | X4 | X5 |
|---|---|---|---|---|---|---|---|
| Zuma | 2010 | yet, we still lose a lot of water through leaking pipes and inadequate infrastructure. | 0.023575 | 0.023575 | 0.023575 | 0.9057001 | 0.023575 |
Using this method, the sentences can be roughly classified to a topic based on the probabilities (i.e. classify the sentence by the topic with the highest probability) and further analysis can be conducted.
(Note: the which.is.max breaks ties at random so where a sentence has equal probabilities, is will decide at random to which topic it gets assigned)
Figure: WordCloud for Topic 5
Consider the mixture of topics that each individual president covers during the SONA address. Despite the imbalance in the number of sentences said by each president, there seems to be a fairly standard shape to the topics discussed. The two exceptions to this are de Klerk and Zuma. All other presidents tend to send around 10-15% on topic 1 (“Job Creation/Terms of Endearment”) and the 15-20% on Topic 2 (“Economy/Criminal and Justice System”), Topic 3 (“Development”) and Topic 4 (“Social Reform/Regional and Municipal Government”) and the around another 10% on Topic 5 (“Public Sector Entities”). This trend means that it may be difficult for a supervised model to pick up difference in presidents based on the topic covered.
As stated, the only two presidents where this trend differs are President de Klerk and President Zuma. President de Klerk spend the majority of his time on Topic 1 (“Job Creation/Terms of Endearment”) followed by Topic 2 (“Economy/Criminal and Justice System”). Given the context around the time period, it may be unsurprising that “terms of endearment” and “criminal and Justice systems” come up since his speech would be littered with names of people and political parties as well as talking about past injustices.
President Zuma spends the majority of his speeches on Topic 4 (“Social Reform/Regional and Municipal Government”). Once again, given context that his terms as President was marked with service delivery strikes, two major droughts over a few different regions and discussions around and reform this may be unsurprisingly. And in fact, when the most popular terms from topic 4 is recalled (“fellow south”) is may even be predictable that this would be the most “talked about” topic for President Zuma. What is interesting that given the attention to the issues of State Capture that characterised Zuma’s presidency, his coverage of Topic 5 (“Public Sector Enterprises”) is much smaller than that of his peers.
A similar analysis can be taken over time.
Figure: WordCloud for Topic 5
The graph shows that over time, topics 1 and 5 are the least discussed topics while topics 2,3 and 4 all get much the same airtime. There are a number of notable spikes/valleys: + In 1996, Topic 2 (“Economy/Crime and Justice System”) spikes
The 1996 SONA was a few months ahead of the introduction of the new constitution as well as at the time of the start of the Truth and Reconciliation Commission. It could be suggested that these two ideas would drive up the topic in the SONA speech.
Mbeki’s term in presidency (1998 - 2008) was characterised by a rise in crime specifically in farm attacks as well as the HIV/AID epidemic and the start of the Black Economic Empowerment in 2005 which could attribute the spikes and drops of topics in 2005.
From various media reports, Zuma’s 2012 SONA speech largely covered the success of the government while skipping over future plans. It may be a reason while Topic 4 (“Social Reform/Regional and Municipal Government”) rises sharply.
One of the aims behind topic modelling is to reduce the dimensions of the data to allow for other techniques to be applied. In this instance, the aim was to reduce the SONA speeches to a collection of topics that would help predict which president was responsible for a sentence in the SONA speech. The assumption was that each president might have a unique set of topics or mixture of topics that could characterise their particular speech. However, there does not seem to be evidence of this. The matrix with the probability of each sentence belonging to a topic is used in Section X and the results are discussed.